Selection of transform size in video coding

ABSTRACT

A method for decoding an encoded video bitstream includes generating a decoded frame by decoding a current frame from the encoded video bitstream. The decoding of the current frame may include decoding, from the encoded video bitstream, a transform mode for the current frame, identifying the transform mode for a current block of the current frame on a condition that the transform mode for the current frame is a per-block transform mode, using the transform mode for the current frame as the transform mode for the current block on a condition that the transform mode for the current frame is a per-frame transform mode, identifying a prediction mode for the current block, and generating a decoded block for the current block using the prediction mode and the transform mode for the current block.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application is a continuation of U.S. patent application Ser. No.14/976,438, filed Dec. 21, 2015, which is a divisional of U.S. patentapplication Ser. No. 13/743,602, filed Jan. 17, 2013, now U.S. Pat. No.9,219,915.

TECHNICAL FIELD

This disclosure relates to encoding and decoding visual data, such asvideo stream data, for transmission or storage using intra prediction.

BACKGROUND

Digital video streams typically represent video using a sequence offrames or still images. Each frame can include a number of blocks, whichin turn may contain information describing the value of color,brightness or other attributes for pixels. The amount of data in atypical video stream is large, and transmission and storage of video canuse significant computing or communications resources. Variousapproaches have been proposed to reduce the amount of data in videostreams, including compression and other encoding techniques.

SUMMARY

This application relates to encoding and decoding of video stream datafor transmission or storage. Disclosed herein are aspects of systems,methods, and apparatuses related to selection of transform size in videocoding.

An aspect is a method for decoding an encoded video bitstream, includinggenerating a decoded frame by decoding a current frame from the encodedvideo bitstream. The decoding of the current frame may include decoding,from the encoded video bitstream, a transform mode for the currentframe, identifying the transform mode for a current block of the currentframe on a condition that the transform mode for the current frame is aper-block transform mode, using the transform mode for the current frameas the transform mode for the current block on a condition that thetransform mode for the current frame is a per-frame transform mode,identifying a prediction mode for the current block, and generating adecoded block for the current block using the prediction mode and thetransform mode for the current block.

Another aspect is apparatus for decoding an encoded video bitstreamhaving a plurality of frames. The apparatus may include a memory, and aprocessor configured to execute instructions stored in the memory togenerate a decoded frame by decoding a current frame from the encodedvideo bitstream, decode, from the encoded video bitstream, a transformmode for the current frame, identify a transform mode for a currentblock of the current frame on a condition that the transform mode forthe current frame is a per-block transform mode, use the transform modefor the current frame as the transform mode for the current block on acondition that the transform mode for the current frame is a per-frametransform mode, identify a prediction mode for the current block, andgenerate a decoded block for the current block using the prediction modeand the transform mode for the current block.

Another aspect is a method of is a method for decoding an encoded videobitstream having a plurality of frames. The method may include decodingfrom the encoded video bitstream, a transform mode for a current frameof the plurality of frames, wherein the transform mode includes one ofindicating a default transform size for each block of the frame orindicating a per-block transform mode, identifying a transform size fora current block by inspecting a block header of the current block on acondition that the default transform mode indicates the per-blocktransform mode, and decoding the current block using the transform size.

Variations in these and other aspects of this disclosure will bedescribed in additional detail hereafter.

BRIEF DESCRIPTION OF THE DRAWINGS

The description herein makes reference to the accompanying drawingswherein like reference numerals refer to like parts throughout theseveral views unless otherwise noted.

FIG. 1 is a schematic of a video encoding and decoding system.

FIG. 2 is a block diagram of an exemplary computing device that canimplement a transmitting station or a receiving station.

FIG. 3 is a diagram of a typical video stream to be encoded andsubsequently decoded.

FIG. 4 is a block diagram of a video compression system in accordancewith an implementation of the teachings herein.

FIG. 5 is a block diagram of a video decompression system in accordancewith another implementation in accordance with the teachings herein.

FIG. 6 is a flowchart of a process for encoding of a video stream usingselectable transform sizes according to an aspect of the teachingsherein.

FIG. 7 is a flowchart of a process for decoding of a video bitstreamusing selectable transform sizes according to an aspect of the teachingsherein.

FIG. 8 is a diagram showing a frame header and a block header.

DETAILED DESCRIPTION

Digital video encoding is used for various purposes including, forexample, remote business meetings via video conferencing, highdefinition video entertainment, video advertisements, and sharing ofuser-generated videos. Encoding can include compressing the video streamto reduce required bandwidth by reducing the amount of data that needsto be included in an encoded bitstream.

As will be discussed in more detail in reference to FIG. 3, digitalvideo streams can include frames divided into blocks. Compressing thevideo stream can be achieved by, for example, encoding the blocks usingprediction. Intra prediction, for example, uses data from blocksperipheral to the current block to generate a prediction block. Only thedifference between the prediction block and current block is included inthe encoded bitstream for later reconstruction of the current block.Larger blocks can be divided into smaller blocks for purposes ofimproving the accuracy of prediction. In one implementation, a 16×16macroblock can be divided into four 8×8 blocks such that each 8×8 blockis predicted separately. Aspects of disclosed implementations can dividea larger block, such as a 16×16 or 8×8 macroblock or larger, into aseries of 4×4 blocks or smaller that can be combined into rectangularsub-blocks that include all of the pixels of the larger block in one andonly one sub-block.

As will be discussed in more detail in reference to FIG. 4, residualblocks of video data (e.g., the differences between the current blockand prediction block pixels) can be transformed following predictionusing any number of transforms, such as a discrete cosine transform(DCT). Transforms may be available in different sizes, such as 64×64,32×32, 16×16, 8×8, 4×4 or rectangular combinations of 4×4 blocks.Depending on the input, one transform size or mode may be moreefficiently encode residual information than another. For example, whenthe video source has a high spatial coherence (e.g., high-definition(HD) material), a 16×16 DCT might be highly efficient. On the otherhand, for very complicated blocks, a 4×4 transform can be moreefficient.

The transform mode can be tied to the intra prediction size of theblocks. Transform block size can be the same size or smaller than theintra prediction block size. Associating transform block size with intraprediction block size has the benefits of simpler coding schemes and nooverhead for signaling transform block size. It can be beneficial insome cases to allow variable transform sizes. In these cases thetransform block size can be signaled on a per-segment, per-frame orper-block level. Signaling at the per-block level as opposed to theper-segment or per-frame level, while potentially increasing theaccuracy of the coding, can introduce additional bits into thebitstream. For example, for a two-segment frame, the background segmentcan enforce coupling transform size with prediction size, while theforeground segment can specify selection of transform size on aper-block basis. Choosing an optimal approach by attempting everycombination can be computationally unfeasible.

According to techniques described herein, specifying the transform blocksize can be indicated without adding significant computational overheadand without adding a significant number of bits to the bitstream.Aspects of disclosed implementations can accomplish this by indicating adefault transform mode in a frame header and then estimating thedistortion of transform modes other than the default transform mode. Ifan estimated transform mode incurs less distortion than the defaulttransform mode, the default transform mode can be updated followingencoding the frame for use by subsequent frames. In this fashion thedefault transformation mode can converge to an optimal state afterencoding a small number of frames. Details of certain implementations ofthe teachings herein are described after first discussing environmentsin which aspects of this disclosure may be implemented.

FIG. 1 is a schematic of a video encoding and decoding system 100. Anexemplary transmitting station 112 can be, for example, a computerhaving an internal configuration of hardware such as that described inFIG. 2. However, other suitable implementations of transmitting station112 are possible. For example, the processing of transmitting station112 described herein can be distributed among multiple devices.

A network 128 can connect transmitting station 112 and a receivingstation 130 for encoding and subsequent decoding of the video stream.Specifically, the video stream can be encoded in transmitting station112 and the encoded video stream can be decoded in receiving station130. Network 128 can be, for example, the Internet. Network 128 can alsobe a local area network (LAN), wide area network (WAN), virtual privatenetwork (VPN), cellular telephone network or any other means oftransferring the video stream from transmitting station 112 to, in thisexample, receiving station 130.

Receiving station 130, in one example, can be a computer having aninternal configuration of hardware such as that described in FIG. 2.However, other suitable implementations of receiving station 130 arepossible. For example, the processing of receiving station 130 can bedistributed among multiple devices.

Other implementations of video encoding and decoding system 100 arepossible. For example, an implementation can omit network 128. Inanother implementation, a video stream can be encoded and then storedfor transmission at a later time to receiving station 130 or any otherdevice having memory. In one implementation, receiving station 130receives (e.g., via network 128, a computer bus, and/or somecommunication pathway) the encoded video stream and stores the videostream for later decoding. In an exemplary implementation, a real-timetransport protocol (RTP) is used for transmission of the encoded videoover network 128. In another implementation, a transport protocol otherthan RTP may be used, e.g., an HTTP-based video streaming protocol.

FIG. 2 is a block diagram of an exemplary computing device 200 that canimplement a transmitting station or a receiving station. For example,computing device 200 can implement one or both of transmitting station112 and receiving station 130 of FIG. 1. Computing device 200 can be inthe form of a computing system including multiple computing devices, orin the form of a single computing device, for example, a mobile phone, atablet computer, a laptop computer, a notebook computer, a desktopcomputer, and the like.

A CPU 224 in computing device 200 can be a conventional centralprocessing unit. Alternatively, CPU 224 can be any other type of device,or multiple devices, capable of manipulating or processing informationnow-existing or hereafter developed. Although the implementationsdescribed herein can be practiced with a single processor as shown,e.g., CPU 224, advantages in speed and efficiency can be achieved usingmore than one processor.

A memory 226 in computing device 200 can be a read only memory (ROM)device or a random access memory (RAM) device in an implementation. Anyother suitable type of storage device can be used as memory 226. Memory226 can include code and data 227 that is accessed by CPU 224 using abus 230. Memory 226 can further include an operating system 232 andapplication programs 234, the application programs 234 including atleast one program that permits CPU 224 to perform the methods describedhere. For example, application programs 234 can include applications 1through N, which further include a video encoding application thatperforms the methods described here. Computing device 200 can alsoinclude a secondary storage 236, which can, for example, be a memorycard used with a mobile computing device 200. Because the videocommunication sessions may contain a significant amount of information,they can be stored in whole or in part in secondary storage 236 andloaded into memory 226 as needed for processing.

Computing device 200 can also include one or more output devices, suchas a display 228. Display 228 may be, in one example, a touch sensitivedisplay that combines a display with a touch sensitive element that isoperable to sense touch inputs. Display 228 can be coupled to CPU 224via bus 230. Other output devices that permit a user to program orotherwise use computing device 200 can be provided in addition to or asan alternative to display 228. When the output device is or includes adisplay, the display can be implemented in various ways, including by aliquid crystal display (LCD), a cathode-ray tube (CRT) display or lightemitting diode (LED) display, such as an OLED display.

Computing device 200 can also include or be in communication with animage-sensing device 238, for example a camera, or any otherimage-sensing device 238 now existing or hereafter developed that cansense an image such as the image of a user operating computing device200. Image-sensing device 238 can be positioned such that it is directedtoward the user operating computing device 200. In an example, theposition and optical axis of image-sensing device 238 can be configuredsuch that the field of vision includes an area that is directly adjacentto display 228 and from which display 228 is visible.

Computing device 200 can also include or be in communication with asound-sensing device 240, for example a microphone, or any othersound-sensing device now existing or hereafter developed that can sensesounds near computing device 200. Sound-sensing device 240 can bepositioned such that it is directed toward the user operating computingdevice 200 and can be configured to receive sounds, for example, speechor other utterances, made by the user while the user operates computingdevice 200.

Although FIG. 2 depicts CPU 224 and memory 226 of computing device 200as being integrated into a single unit, other configurations can beutilized. The operations of CPU 224 can be distributed across multiplemachines (each machine having one or more of processors) that can becoupled directly or across a local area or other network. Memory 226 canbe distributed across multiple machines such as a network-based memoryor memory in multiple machines performing the operations of computingdevice 200. Although depicted here as a single bus, bus 230 of computingdevice 200 can be composed of multiple buses. Further, secondary storage236 can be directly coupled to the other components of computing device200 or can be accessed via a network and can comprise a singleintegrated unit such as a memory card or multiple units such as multiplememory cards. Computing device 200 can thus be implemented in a widevariety of configurations.

FIG. 3 is a diagram of an example of a video stream 350 to be encodedand subsequently decoded. Video stream 350 includes a video sequence352. At the next level, video sequence 352 includes a number of adjacentframes 354. While three frames are depicted as adjacent frames 354,video sequence 352 can include any number of adjacent frames. Adjacentframes 354 can then be further subdivided into individual frames, e.g.,a single frame 356. At the next level, a single frame 356 can be dividedinto a series of segments or planes 358. Segments or planes 358 can besubsets of frames that permit parallel processing, for example. Segmentsor planes 358 can be subsets of frames that can separate the video datain a frame into separate colors. For example, a frame of color videodata can include a luminance plane and two chrominance planes. Segmentsor planes 358 may be sampled at different resolutions.

Segments or planes 358 can include blocks 360, which may contain datacorresponding to, for example, macroblocks of 16×16 or 32×32 pixels inframe 356. Blocks 360 can also be of any other suitable size such as4×4, 8×8, 16×8 or 8×16 pixels. Blocks 360 can, for example, includepixel data from a luminance plane and two chrominance planes. Unlessotherwise noted, the terms block and macroblock are used interchangeablyherein.

FIG. 4 is a block diagram of an encoder 470 in accordance with animplementation. Encoder 470 can be implemented, as described above, intransmitting station 112 such as by providing a computer softwareprogram stored in memory, for example, memory 226. The computer softwareprogram can include machine instructions that, when executed by aprocessor such as CPU 224, cause transmitting station 112 to encodevideo data in the manner described in FIG. 4. Encoder 470 can also beimplemented as specialized hardware included in, for example,transmitting station 112. Encoder 470 has the following stages toperform the various functions in a forward path (shown by the solidconnection lines) to produce an encoded or compressed bitstream 488using input video stream 350: an intra/inter prediction stage 472, atransform stage 474, a quantization stage 476, and an entropy encodingstage 478. Encoder 470 may also include a reconstruction path (shown bythe dotted connection lines) to reconstruct a frame for encoding offuture blocks. In FIG. 3, encoder 470 has the following stages toperform the various functions in the reconstruction path: adequantization stage 480, an inverse transform stage 482, areconstruction stage 484, and a loop filtering stage 486. Otherstructural variations of encoder 470 can be used to encode video stream350.

When video stream 350 is presented for encoding, each frame 356including planes or segments 358 within the video stream 350 can beprocessed in units of blocks 360. At the intra/inter prediction stage472, each block can be encoded using intra-frame prediction (also calledintra prediction herein) or inter-frame prediction (also called interprediction herein). In the case of intra-prediction, a prediction blockcan be formed from samples in the current frame that have beenpreviously encoded and reconstructed. In the case of inter-prediction, aprediction block can be formed from samples in one or more previouslyencoded and reconstructed reference frames.

Next, still referring to FIG. 4, the prediction block can be subtractedfrom the current block at intra/inter prediction stage 472 to produce aresidual block (also called a residual). Transform stage 474 transformsthe residual into transform coefficients in, for example, the frequencydomain. Examples of block-based transforms include the Karhunen-LoèveTransform (KLT), the Discrete Cosine Transform (DCT), AsymmetricalDiscrete Cosine Transform (ADST) and the Singular Value DecompositionTransform (SVD). In one example, the DCT transforms the block into thefrequency domain. In the case of DCT, the transform coefficient valuesare based on spatial frequency, with the lowest frequency (DC)coefficient at the top-left of the matrix and the highest frequencycoefficient at the bottom-right of the matrix.

Quantization stage 476 converts the transform coefficients into discretequantum values, which are referred to as quantized transformcoefficients, using a quantizer value or a quantization level. Thequantized transform coefficients are then entropy encoded by entropyencoding stage 478. The entropy-encoded coefficients, together withother information used to decode the block, which may include forexample the type of prediction used, motion vectors and quantizer value,are then output to compressed bitstream 488. Compressed bitstream 488can be formatted using various techniques, such as variable lengthcoding (VLC) or arithmetic coding. Compressed bitstream 488 can also bereferred to as an encoded video bitstream and the terms will be usedinterchangeably herein.

The reconstruction path in FIG. 4 (shown by the dotted connection lines)can be used to ensure that both encoder 470 and a decoder 500 (describedbelow) use the same reference frames to decode compressed bitstream 488.The reconstruction path performs functions that are similar to functionsthat take place during the decoding process that are discussed in moredetail below, including dequantizing the quantized transformcoefficients at dequantization stage 480 and inverse transforming thedequantized transform coefficients at inverse transform stage 482 toproduce a derivative residual block (also called a derivative residual).At reconstruction stage 484, the prediction block that was predicted atthe intra/inter prediction stage 472 can be added to the derivativeresidual to create a reconstructed block. Loop filtering stage 486 canbe applied to the reconstructed block to reduce distortion such asblocking artifacts.

Other variations of encoder 470 can be used to encode compressedbitstream 488. For example, a non-transform based encoder 470 canquantize the residual signal directly without transform stage 474. Inanother implementation, an encoder 470 can have quantization stage 476and dequantization stage 480 combined into a single stage.

FIG. 5 is a block diagram of a decoder 500 in accordance with anotherimplementation. Decoder 500 can be implemented in receiving station 130,for example, by providing a computer software program stored in memory226. The computer software program can include machine instructionsthat, when executed by a processor such as CPU 224, cause receivingstation 130 to decode video data in the manner described in FIG. 5.Decoder 500 can also be implemented in hardware included in, forexample, transmitting station 112 or receiving station 130.

Decoder 500, similar to the reconstruction path of encoder 470 discussedabove, includes in one example the following stages to perform variousfunctions to produce an output video stream 516 from compressedbitstream 488: an entropy decoding stage 502, a dequantization stage504, an inverse transform stage 506, an intra/inter prediction stage508, a reconstruction stage 510, a loop filtering stage 512 and adeblocking filtering stage 514. Other structural variations of decoder500 can be used to decode compressed bitstream 488.

When compressed bitstream 488 is presented for decoding, the dataelements within compressed bitstream 488 can be decoded by entropydecoding stage 502 (using, for example, arithmetic coding) to produce aset of quantized transform coefficients. Dequantization stage 504dequantizes the quantized transform coefficients, and inverse transformstage 506 inverse transforms the dequantized transform coefficients toproduce a derivative residual that can be identical to that created byinverse transform stage 482 in encoder 470. Using header informationdecoded from compressed bitstream 488, decoder 500 can use intra/interprediction stage 508 to create the same prediction block as was createdin encoder 470, e.g., at intra/inter prediction stage 472. Atreconstruction stage 510, the prediction block can be added to thederivative residual to create a reconstructed block. Loop filteringstage 512 can be applied to the reconstructed block to reduce blockingartifacts. Other filtering can be applied to the reconstructed block.For example, deblocking filtering stage 514 can be applied to thereconstructed block to reduce blocking distortion, and the result isoutput as output video stream 516. Output video stream 516 can also bereferred to as a decoded video stream and the terms will be usedinterchangeably herein.

Other variations of decoder 500 can be used to decode compressedbitstream 488. For example, decoder 500 can produce output video stream516 without deblocking filtering stage 514.

As described briefly above, aspects of disclosed implementations cansignal transform size on a per-segment, per-frame or per-block levelaccording to the teachings herein. According to the examples providedherein, the signaling taught herein allows a 4×4, 8×8 or 16×16 or otherrectangular transform to be signaled on a per-segment, per-block levelor per-frame level without testing every combination of transform sizeand intra prediction block size. This can be accomplished by, firstly,limiting the transform size to the same size or smaller than the blocksize used in prediction. Then, while processing the block in a ratedistortion loop, in addition to calculating the rate distortion for theblock using the default transform size, rate distortion values can beestimated for additional transform sizes. Following the rate distortionloop the estimated rate distortion values can be compared to thecalculated rate distortion value and in cases where the estimated valueis less than the calculated value the default transform mode can beupdated for use in encoding subsequent frames.

Transform mode can indicate either a maximum transform size, such as16×16, 8×8 or 4×4, or a per-block transform size, where the transformsize is determined for each block individually and indicated in a blockheader associated with the block. In this latter case, several stepsincorporated in a rate distortion loop can be used to choose whichtransform size, and hence transform mode, to include for each block. Inthe rate distortion loop, the encoder loops through the blocks of aframe and estimates the distortion, or the magnitude of the residualvalues left following prediction, transformation and quantization forvarious combinations of intra prediction block sizes and transformsizes, and the rate, or the number of additional bits required in theencoded bitstream to indicate the prediction mode and transform size.Based on this calculation, a transform mode can be chosen.

FIG. 6 is a flowchart of a process 600 for encoding a video stream usingselectable transform sizes according to an aspect of the teachingsherein. Process 600 can be implemented in a system, such as system 470,to encode a video stream. Process 600 can be implemented, for example,as a software program that is executed by a computing device such astransmitting station 112 or receiving station 130. The software programcan include machine-readable instructions that are stored in a memorysuch as memory 226 that, when executed by a processor such as CPU 224,cause the computing device to perform process 600. Process 600 can alsobe implemented using hardware in whole or in part. As explained above,some computing devices may have multiple memories and multipleprocessors, and the steps of process 600 may in such cases bedistributed using different processors and memories. Use of the terms“processor” and “memory” in the singular encompasses computing devicesthat have only one processor or one memory as well as devices havingmultiple processors or memories that may each be used in the performanceof some but not necessarily all of the recited steps.

For simplicity of explanation, process 600 is depicted and described asa series of steps. However, steps in accordance with this disclosure canoccur in various orders and/or concurrently. Additionally, steps inaccordance with this disclosure may occur with other steps not presentedand described herein. Furthermore, not all illustrated steps may berequired to implement a method in accordance with the disclosed subjectmatter.

Process 600 represents portions of a rate distortion loop. A ratedistortion loop is part of a technique for encoding a frame of a videostream. At step 602, a default transform mode for a frame of video datato be encoded is identified. Identified means determined, calculated,discovered, chosen or otherwise identified in any manner whatsoever. Thedefault transform mode can be one of several choices of a maximumtransform size for the frame or segment as a whole or can be a mode thatselects the transform size at a per-block level. As discussed below, thetransform mode can be a maximum size, such as 4×4, 8×8 or 16×16, or thetransform mode can specify that a transform size should be identifiedfor each block of the frame independently. This latter transform mode iscalled a per-block transform mode herein. In the implementationsdescribed herein, larger transforms may not be used with smallerprediction modes when operating in the per-block transform mode. Forexample, a 16×16 transform may not be used with an 8×8 prediction blockused for intra prediction since results of encoding and decoding some ofthe first 8×8 blocks can be used to encode subsequent 8×8 blocks. A usermay specify the default transform mode at an initial value in someimplementations.

At step 604, the rate distortion loop is started by identifying forprocessing a block of the frame. Blocks of the frame can be identifiedfor processing in raster scan order, in which the blocks of a frame areidentified starting at the upper left corner of the frame and thenproceeding along rows of blocks from the top of the frame to the bottom.In one technique of performing a rate distortion loop, a rate distortionmultiplier (sometimes referred to as lambda) is used to specify theweight of rate versus distortion, that is, by how much should thedistortion (residual signal) decrease to be worth adding a number ofbits (e.g., rate) to the compressed bitstream. The loop iterates overmultiple available modes, e.g., various intra and inter predictionmodes, calculates the cost in bits for coding the block in that way(including the mode and transform coefficients), then calculates therate (i.e., how many additional bits would coding the block using themode cost), and then generates the rate/distortion result (also called arate distortion value herein). Then, the loop selects the mode with thebest rate/distortion result. For speed reasons, the loop often appliesthresholds to each coding mode so that coding modes are not tested thatare not likely to improve the rate/distortion result significantly, andthe loop keeps track of mode counts so that the loop will not keeptrying the same mode over time if that particular mode is not used.

More generally, at step 606, process 600 identifies which predictionmodes are available for processing the current block. For example,process 600 can identify whether to use intra or inter prediction forthe current block. This identification also includes the availableprediction block sizes. When the default transform mode is a particularsize, for example, the available prediction modes can be limited tothose prediction modes using block sizes smaller than or equal to theparticular transform size. The available prediction block sizes can alsobe limited to, for example, system defaults where the default transformmode is the per-block transform mode. The available prediction modes fora current block can also be limited based on thresholds and mode countsdescribed above.

At next step 608, process 600 selects the intra prediction mode to usein encoding the block. The prediction mode is selected by encoding theblock using each available prediction mode identified in step 606 andcomparing the resulting rate distortion values in the looping processdescribed above. The prediction mode can include dividing the block intosub-blocks and encoding the sub-blocks. In such a case, the ratedistortion values for the sub-blocks can be summed to determine anoverall rate distortion value for the block. The rate distortion valueis calculated for each block using the default transform mode. Asdiscussed above, one of the available transform modes can include aper-block transform mode. When the default transform mode is theper-block transform mode, the transform size is determined for eachblock of the frame independently. Step 608 thus tests the availabletransform sizes (e.g., those the same size or smaller than theprediction block size) and selects the transform size yielding the bestrate distortion value.

At step 610, the block (including any sub-blocks) may be encoded usingthe selected prediction mode and the default transformation mode. Notethat although process 600 is described as processing for one block at atime through steps 604-616, the processing of steps 604-610 could occurfor all blocks of the frame in the scan order, and then the processingof steps 612 and 614 could occur for all blocks of the frame in the scanorder after step 610 is completed for the last block of the frame.

Making all available choices of transform size available to all blocksof the frame (as opposed to using a default size or a transform sizebased on the prediction block size) could increase the number ofpossible choices in step 608 exponentially (i.e., every possibleintra/inter mode in combination with every transform mode). To preventthis, additional steps to the rate distortion loop test additionaltransform sizes by estimating the rate distortion value for thedifferent sizes instead of calculating the actual rate distortionvalues. Estimation differs from calculation in at least two ways. First,the predicted blocks depend upon previously encoded and decoded blocksto calculate a prediction block. To calculate precisely the ratedistortion value for a given transform size the previously encoded anddecoded blocks are encoded and decoded using the transform size beingtested. Estimating the rate distortion value for a transform sizeinstead uses the results of encoding and decoding the previouslyprocessed blocks using the default transform size, thereby reducingcomputation load. Second, encoding the block data can be stoppedfollowing transformation rather than proceeding to quantization. Therate distortion value can be estimated using the partially encodedtransform results.

Specifically, process 600 identifies additional transform sizes for ablock at step 612. For example, if the default transform mode were aper-frame transform size of 16×16, 8×8 or 4×4, the two transform sizesnot used in step 608 are identified in step 612. If the per-blocktransform mode was indicated, process 600 would have already tested atleast one transform size (and possibly more) and saved the ratedistortion results for analysis following the rate distortion loop atstep 610. The remaining transform sizes as specified by the frame-widetransform modes are then identified in step 612.

Associating a default transform mode for an entire frame can have thebenefit of specifying a transform size without requiring that thetransform size be specified on a per-block basis, thereby saving bits inthe encoded bitstream. If it is determined, for example by performing arate distortion loop, that per-block transform size identification wouldsave more bits than would be added by specifying the transform size,per-block transform size mode can be specified in the frame header. Atlow bitrates, some video streams can experience an increase in bitratesby indicating the transform size at the per-block level, as theincreased cost in bits in the block headers can outweigh the savings inbits due to greater compression ratios achieved.

This teaching can be applied at step 614 by estimating the ratedistortion values for the current block using the additional transformsizes. As discussed above, the rate distortion value is estimated ratherthan calculated to reduce the computation required. In some embodiments,step 614 can include applying transforms to the residual based on theeach additional transform size. A measure of the magnitude of thetransformed residual block can be made that is related to the finalencoded size of the block and can be taken as an estimate of thedistortion. The total bit count for the bits required to specify thetransform size on a per-block level, where applicable, can be applied tothe estimate of the rate. Basically, applying the transform to thepreviously-generated residual can be used to generate a difference(either positive or negative) to the calculated rate distortion value ofstep 608.

At step 616, process 600 queries as to whether any more blocks of theframe require processing and either loops back to step 604 to identifyanother block of the frame for processing or exits the loop. Since step614 provides information as to how the rate would change given anotherframe-level setting for transform size, alternative best rate distortionresults that cover the hypothetical case of choosing a differenttransform size setting could be kept. Thus, at the end of the ratedistortion loop at 618, a rate distortion result that is the optimalcoding mode for the current block given the current transform sizesetting, as well as a set of alternative rate distortion results thatcover cases where alternative transform size settings were to be used,which may be either better or worse than the actual rate distortionresult, are provided. By doing this for a block of the whole currentframe, the values can be summed to determine which transform sizesettings would have led to the best coding for the frame, regardless ofthe actual default transform mode. For example, while iterating throughthe loop for each block, cumulative rate distortion results by transformmode may be maintained in step 614. For the next frame, if the currentsetting were not optimal, these results could be used to choose anothersetting as the default transform mode for the next frame in step 618.The setting chosen is generally that resulting in the lowest cumulativerate distortion result for the entire current frame.

In some cases, if it proves to be sufficiently beneficial (e.g., theprocess is not time constrained and the bit savings is high), it may beworth changing the transform mode and re-doing the rate distortion loopfor the current frame using the new setting.

The encoded blocks, along with the default transform mode and per-blocktransform sizes, where appropriate, are subsequently encoded in thevideo bitstream. For example, FIG. 8 is a diagram showing a frame 802 ofa video stream 800 having a frame header 804. Frame 802 includes one ormore blocks 806 having block headers 808. Bits indicating the defaulttransformation mode may be included in frame header 804. In cases wherethe default transformation mode is the per-block transform mode, bitsindicating which size transform to use may be included in a block header808 associated with a respective block 806.

Using the techniques described herein, the complexity of choosing anoptimal transform mode for frames becomes an integral part of the ratedistortion loop to converge to an optimal result within a small numberof frames. Associating a default transform mode for an entire frame canhave the benefit of specifying a transform size without requiring thatthe transform size be specified on a per-block basis, thereby savingbits in the encoded bitstream. If it is determined, for example byperforming the rate distortion loop, that per-block transform sizeidentification would save more bits than would be added by specifyingthe transform size, per-block transform size mode can be specified inthe frame header. At low bitrates, some video streams can experience anincrease in bitrates by indicating the transform size at the per-blocklevel, as the increased cost in bits in the block headers can outweighthe savings in bits due to greater compression ratios achieved.Accordingly, low bitrate streams are likely to choose a single transformsize at the frame-level, whereas medium-or-high bitrate streams maysignificantly benefit from signaling the transform size per block. Thetechniques taught herein also allow adjustment during the processing ofthe video stream to take advantage of changes in the captured imagesthat may result in desirable changes to the transform mode.

In order to reduce computational requirements, rate distortion changesfor the additional transform sizes do not have to be estimated for eachblock of each frame. For example, rate distortion changes could beestimated for the additional transform sizes every n^(th) block. N maybe a number between 1 and 100 in an implementation. In this way, thetransform mode can be updated for each frame without requiring that ratedistortion changes be estimated for each additional transform size foreach block. In addition, the default transform mode does not have to beupdated for each frame. Aspects of disclosed implementations canestimate the rate distortion for additional transform sizes every m^(th)frame, where m is a number between 1 and 100, for example. Using eitheror both of these approaches can further reduce computationalrequirements and thereby speed up the encoding process while stillpermitting the transform mode to be updated and converge to an optimalresult.

FIG. 7 is a flowchart of a process 700 for decoding a video bitstreamusing variable sized transforms according to an aspect of the teachingsherein. Process 700 can be implemented in a system such as system 500 todecode a video bitstream. Process 700 can be implemented, for example,as a software program that is executed by a computing device such astransmitting station 112 or receiving station 130. The software programcan include machine-readable instructions that are stored in a memorysuch as memory 226 that, when executed by a processor such as CPU 224,cause the computing device to perform process 700. Process 700 can alsobe implemented using hardware in whole or in part. As explained above,some computing devices may have multiple memories and multipleprocessors, and the steps of process 700 may in such cases bedistributed using different processors and memories.

For simplicity of explanation, process 700 is depicted and described asa series of steps. However, steps in accordance with this disclosure canoccur in various orders and/or concurrently. Additionally, steps inaccordance with this disclosure may occur with other steps not presentedand described herein. Furthermore, not all illustrated steps may berequired to implement a method in accordance with the disclosed subjectmatter.

At step 702, process 700 begins decoding an encoded video bitstream byfirst identifying the default transform mode. The default transform modecan be indicated by bits included in a frame header associated with thecurrent frame of the encoded video bitstream. In one implementation, thedefault transform mode can either specify a default transform size, suchas 4×4, 8×8 or 16×16, or can specify that transform sizes are includedin block headers received by the decoder with the blocks of the frame(e.g., the per-block transform mode).

At step 704, process 700 begins a loop wherein the blocks of the frameare processed by first identifying a current block of the frame forprocessing in scan order. Blocks may be identified for processingstarting at the upper left hand corner of the frame and continuing inraster scan order until the blocks of the frame have been processed. Atstep 706, if the default transform mode identified at step 702 is theper-block transform mode, process 700 passes to step 708, where theblock header is inspected to identify which transform size will be usedto inverse transform the block being decoded. If the default transformmode is not the per-block transform mode, process 700 uses the per-frametransform size associated with the default transform mode identified instep 702.

Regardless of the transform mode, at next step 710, the block header isinspected to identify the prediction mode to be used to decode theblock. Then, the current block is decoded using the identifiedprediction mode and the transform size at step 712. Following this,process 700 checks to see if any blocks of the frame remain to beprocessed at step 714, and if so, returns to step 704 to identify thenext block to be decoded. Otherwise, process 700 exits.

Forcing one transform size can greatly improve or hurt quality dependingon the video. The teachings herein allow transform size to be variablewithout unconditionally imposing such a setting. The default transformmode indicator included in the frame header can be used to eitherindicate a transform size to be used for the blocks of a frame or toindicate that per-block transform size is to be used. Cases where asingle transform size is advantageous may include cases where the videodata stream includes generally homogeneous data that changes slowly,e.g., the encoded bitstream has a relatively low bitrate. Cases whereper-block transform size is advantageous may include cases where thevideo data stream includes quickly moving objects and the scene ischanging rapidly, leading to a high bitrate data stream, for example. Inthis latter case, the extra bits associated with indicating a per-blocktransform size are a much smaller percentage of the video bitstreamdata. According to the teachings herein, a near-optimal transform sizeprediction flag setting for given encode/bitrate settings can beselected without having to test each individual transform size settingindividually.

Implementations of the teachings herein gains most of the advantages ofallowing for variable transform sizes, yet hardly increases decodercomplexity. Further, the encoder can choose to set one flagunconditionally if speed is necessary. In alternative implementations,signaling regarding transform mode and/or size can be done at aper-superblock, per segment, etc., level as opposed to or in addition toat the per-frame and per-block level.

The aspects of encoding and decoding described above illustrate someexemplary encoding and decoding techniques. However, it is to beunderstood that encoding and decoding, as those terms are used in theclaims, could mean compression, decompression, transformation, or anyother processing or change of data.

The words “example” or “exemplary” are used herein to mean serving as anexample, instance, or illustration. Any aspect or design describedherein as “example” or “exemplary” is not necessarily to be construed aspreferred or advantageous over other aspects or designs. Rather, use ofthe words “example” or “exemplary” is intended to present concepts in aconcrete fashion. As used in this application, the term “or” is intendedto mean an inclusive “or” rather than an exclusive “or”. That is, unlessspecified otherwise, or clear from context, “X includes A or B” isintended to mean any of the natural inclusive permutations. That is, ifX includes A; X includes B; or X includes both A and B, then “X includesA or B” is satisfied under any of the foregoing instances. In addition,the articles “a” and “an” as used in this application and the appendedclaims should generally be construed to mean “one or more” unlessspecified otherwise or clear from context to be directed to a singularform. Moreover, use of the term “an implementation” or “oneimplementation” throughout is not intended to mean the same embodimentor implementation unless described as such.

Implementations of transmitting station 112 and/or receiving station 130(and the algorithms, methods, instructions, etc., stored thereon and/orexecuted thereby, including by encoder 470 and decoder 500) can berealized in hardware, software, or any combination thereof. The hardwarecan include, for example, computers, intellectual property (IP) cores,application-specific integrated circuits (ASICs), programmable logicarrays, optical processors, programmable logic controllers, microcode,microcontrollers, servers, microprocessors, digital signal processors orany other suitable circuit. In the claims, the term “processor” shouldbe understood as encompassing any of the foregoing hardware, eithersingly or in combination. The terms “signal” and “data” are usedinterchangeably. Further, portions of transmitting station 112 andreceiving station 130 do not necessarily have to be implemented in thesame manner.

Further, in one aspect, for example, transmitting station 112 orreceiving station 130 can be implemented using a general purposecomputer or general purpose processor with a computer program that, whenexecuted, carries out any of the respective methods, algorithms and/orinstructions described herein. In addition or alternatively, forexample, a special purpose computer/processor can be utilized which cancontain other hardware for carrying out any of the methods, algorithms,or instructions described herein.

Transmitting station 112 and receiving station 130 can, for example, beimplemented on computers in a video conferencing system. Alternatively,transmitting station 112 can be implemented on a server and receivingstation 130 can be implemented on a device separate from the server,such as a hand-held communications device. In this instance,transmitting station 112 can encode content using an encoder 470 into anencoded video signal and transmit the encoded video signal to thecommunications device. In turn, the communications device can thendecode the encoded video signal using a decoder 500. Alternatively, thecommunications device can decode content stored locally on thecommunications device, for example, content that was not transmitted bytransmitting station 112. Other suitable transmitting station 112 andreceiving station 130 implementation schemes are available. For example,receiving station 130 can be a generally stationary personal computerrather than a portable communications device and/or a device includingan encoder 470 may also include a decoder 500.

Further, all or a portion of implementations of the present inventioncan take the form of a computer program product accessible from, forexample, a tangible computer-usable or computer-readable medium. Acomputer-usable or computer-readable medium can be any device that can,for example, tangibly contain, store, communicate, or transport theprogram for use by or in connection with any processor. The medium canbe, for example, an electronic, magnetic, optical, electromagnetic, or asemiconductor device. Other suitable mediums are also available.

The above-described embodiments, implementations and aspects have beendescribed in order to allow easy understanding of the present inventionand do not limit the present invention. On the contrary, the inventionis intended to cover various modifications and equivalent arrangementsincluded within the scope of the appended claims, which scope is to beaccorded the broadest interpretation so as to encompass all suchmodifications and equivalent structure as is permitted under the law.

What is claimed is:
 1. A method for decoding an encoded video bitstream,the method comprising: generating, by a processor in response toinstructions stored on a non-transitory computer readable medium, adecoded frame by decoding a current frame from the encoded videobitstream, wherein decoding the current frame includes: decoding, fromthe encoded video bitstream, a first value for decoding the currentframe; in response to a determination that the first value indicates theuse of per-block transform signaling, identifying, based on informationother than the first value, a transform for decoding a current block ofthe current frame; in response to a determination that the first valueindicates the use of per-frame transform signaling, using the firstvalue as an identifier of the transform; identifying a prediction modefor the current block; and generating a decoded block for the currentblock using the prediction mode and the transform.
 2. The method ofclaim 1, wherein the first value is identified by inspecting bitsincluded in a frame header of the current frame.
 3. The method of claim1, further comprising: identifying an updated transform for a subsequentframe relative to the current frame, the updated transform based on thefirst value, calculated rate distortion values, and estimated ratedistortion values.
 4. The method of claim 1, wherein identifying thetransform includes: identifying a transform size of the transform byinspecting block headers for the current block.
 5. The method of claim1, wherein the prediction mode is one of a plurality of prediction modesusing block sizes equal to or smaller than a block size of thetransform.
 6. The method of claim 1, wherein the first value includes anindication of a maximum transform size for each block of the currentframe.
 7. The method of claim 1, wherein identifying the transformincludes identifying an indication of a transform size being one of16×16, 8×8, or 4×4.
 8. An apparatus for decoding an encoded videobitstream having a plurality of frames, the apparatus comprising: amemory; and a processor configured to execute instructions stored in thememory to: decode, from the encoded video bitstream, a first value fordecoding a current frame from the encoded video bitstream; in responseto a determination that the first value indicates the use of per-blocktransform signaling, identify, based on information other than the firstvalue, a transform for decoding a current block of the current frame; inresponse to a determination that the first value indicates the use ofper-frame transform signaling, use the first value as an identifier ofthe transform; identify a prediction mode for the current block; andgenerate a decoded block for the current block using the prediction modeand the transform.
 9. The apparatus of claim 8, wherein the processor isconfigured to execute instructions stored in the memory to: identify thefirst value by inspecting bits included in a frame header of the currentframe.
 10. The apparatus of claim 8, wherein the processor is configuredto execute instructions stored in the memory to: identify an updatedtransform for a subsequent frame relative to the current frame, theupdated transform based on the first value, calculated rate distortionvalues, and estimated rate distortion values.
 11. The apparatus of claim8, wherein the processor is configured to execute instructions stored inthe memory to: identify a transform size of the transform for each blockin the frame by inspecting respective block headers in response to thedetermination that the first value indicates the use of per-blocktransform signaling.
 12. The apparatus of claim 8, wherein theprediction mode is one of a plurality of prediction modes using blocksizes equal to or smaller than a block size of the transform.
 13. Theapparatus of claim 8, wherein the first value includes an indication ofa maximum transform size for all blocks of the current frame.
 14. Amethod for decoding an encoded video bitstream having a plurality offrames, the method comprising: generating, by a processor in response toinstructions stored on a non-transitory computer readable medium, adecoded frame by decoding a current frame from the encoded videobitstream, wherein decoding the current frame includes: decoding, fromthe encoded video bitstream, a transform mode signaling type identifierfor the current frame, wherein the transform mode signaling typeidentifier indicates the use of per-frame transform mode signaling andidentifies a transform size, or the transform mode signaling typeidentifier indicates the use of per-block transform mode signaling andomits the transform size; in response to a determination that thetransform mode signaling type identifier indicates the use of per-blocktransform mode signaling, identifying the transform size by inspecting ablock header of a current block; and decoding the current block using atransform having the transform size.
 15. The method of claim 14, whereinthe transform mode signaling type identifier for the current frame isidentified by inspecting bits included in a frame header of the currentframe.
 16. The method of claim 14, wherein on a condition that thetransform signaling type identifier indicates the transform size, thetransform size represents a maximum transform size for each block of thecurrent frame.
 17. The method of claim 14, wherein the transform modesignaling type identifier indicates a 16×16 transform size, a 8×8transform size, or a 4×4 transform size.